Title: “Effective Solutions for Managing Excess URLs and Optimizing Crawl Budget for a Client’s Website”

SEO October 13, 2024

0 30 2 minutes read

Today’s SEO inquiry comes from Michal in Bratislava, who asks:

"I have a client with a website that generates URLs based on map location filters. When users move on the map, new URLs are created, which are not included in the sitemap. Consequently, there are over 700,000 URLs in the Search Console that are not indexed and are consuming crawl budget.

What is the best way to remove these URLs? My idea is to maintain the base location as ‘index, follow’ and switch the newly created filtered URLs for the surrounding area to ‘noindex, nofollow.’ Additionally, I plan to use canonicals pointing to the base location and disavow unwanted links."

Great question, Michal! Fortunately, implementing the solution is quite straightforward.

First, let’s relate your situation to similar scenarios in ecommerce and publishing to benefit a wider audience before delving into your proposed strategies and concluding with a solution.

Understanding Crawl Budget and Parameter Issues

If unfamiliar with the term "crawl budget," it refers to Google’s and other search engines’ practice of only crawling a finite number of pages on your site before stopping. If this budget is wasted on low-value, thin, or non-indexable pages, essential new or longstanding pages may go unnoticed and unindexed, potentially missing SEO traffic opportunities. Hence, optimizing crawl efficiency is crucial.

Michal illustrates how "thin" URLs can proliferate when users apply filter systems. While valuable for user experience, SEO perspective favors location-based pages. This is relevant to both ecommerce and publishing industries.

For example, ecommerce sites often generate URLs with parameters from color, size, or other filter searches, which aid users but compete with collection pages, the "non-thin" alternatives.

Similarly, publishers may produce filtered results that compete with more authoritative category pages. These filtered results can get indexed through social media sharing or backlinks, consuming valuable crawl time.

To maximize crawl budget efficiency, such "thin" pages must be managed wisely.

Indexing vs. Crawling

Before discussing solutions, it’s important to differentiate indexing from crawling:

Crawling: The process of discovering new pages on a site.
Indexing: The process of adding pages to the search engine’s database to be displayed when relevant queries are made.

Pages might be crawled without being indexed. For effective use of crawl resources, proper strategies include:

Using Meta Robots or X Robots

Michal suggested implementing an "index, follow" directive, indicating a request to index and follow a page’s links. However, for non-ideal filtered results, consider "noindex, follow," which suggests a page not meant for indexing while instructing continued site crawling.

Canonicals for Crawl Budget Optimization

Canonical links inform search engines which page version is the official one. If content spans multiple URL versions, setting canonicals directs indexing efforts toward the preferred page. This approach applies to search results leading to duplicates of official site pages.

Disavowing Ineffectively for Crawl Efficiency

While disavowal addresses spammy backlinks, it doesn’t impact crawl efficiency. Google’s disavow tool isn’t recommended for internal site URLs, as it’s unrelated to crawling directives.

Efficiently Managing Crawl Budgets

The primary tool for optimizing crawl paths is robots.txt, which instructs search engines on which folders and parameters to engage. This is where you define disallowed paths for unwanted parameters, thus conserving crawl resources.

Enhance these directives by considering internal linking structures—breadcrumbs, menus, in-content links, and sitemaps can guide crawlers to prioritized pages effectively.

I hope this provides clarity to your question. It’s a common concern, so rest assured, you’re not alone.